Natural Language Processing insights from the National Innovation Centre for Data

👤 About me – Mac Misiura

  • Obtained a PhD in Applied Mathematics and Statistics from Newcastle University
  • Joined the National Innovation Centre for Data as a Data Scientist in 2021
  • Particularly interested in:
    • Generative AI, focussing on (open source) Large Language Models an-introduction-to-the-open-source-large-language-models)
    • AI Assurance
  • Anticipates becoming a prompt engineer in the near future 😆

images generated using textual-inversion fine-tuning for Stable Diffusion | model with a learnt concept of Me is available on Hugging Face

🌇 About National Innovation Centre for Data

The National Innovation Centre for Data runs projects with organisations to help them acquire new skills and innovate through data

🌠 Collaborators

🌠 Collaborators

📚 Theory

What is natural language processing ?

Natural language processing is a subfield of Artificial Intelligence (AI) concerned with developing systems that deal with natural language

What is language ?

Language is a structured system of communication containing the following elements:

  • a collection of principles describing how to create appropriate utterances (grammar)
  • a set of words relating to the world (vocabulary)

What is natural language ?

Natural language is any language occurring in a human community by a process of use, repetition, and change without conscious planning or premeditation

What is NOT a natural language ?

The following communication systems are not considered natural languages

  • constructed languages, i.e. 

    • fictional languages, such as 👽 Klingon
    • programming languages, such as 🐍 Python
    • international auxiliary languages, such as 🎏 Esperanto
  • non-human communication systems, such as 🐝 bee dancing

📜 Ancient history of natural language processing

Phase 1: 1940s-1960s

  • Research focussed on machine translation (MT)
  • In 1954, the Georgetown–IBM experiment automatically translated 60 Russian sentences to English
  • In 1966, the ALPAC Report concluded that MT did not seem feasible

Phase 2: 1960s-1970s

  • Research focussed on building and querying knowledge bases
  • In 1961, the BASEBALL system was developed to answer questions about baseball
  • In 1964, the ELIZA system simulated a Rogerian psychotherapist
  • In 1970, the SHRDLU system simulated a robot which manipulated blocks on a table top with instructions given in English

📜 Ancient history of natural language processing

Phase 3: 1970s-1990s

  • Initially, research continued to focus on rule-based systems, with an emphasis on syntactic and semantic analysis
  • But, by late 1980s, researchers became more united in focusing on empiricism and probabilistic models (i.e. Hidden Markov Models)
  • Notable progress in practical tasks, such as speech recognition or automatic summarisation was made
  • Major trend focussing on evaluating performance emerged

Phase 4: 1990s-2017

  • In 2003, Bengio and colleagues suggested that neural networks could be used to model natural language
  • In 2013, Word2vec introduced a novel way to represent words as dense, continuous-valued vectors in a high-dimensional space, which was notably better than earlier word representation attempts (one-hot encoding or bag-of-words)
  • In 2017, ELMo introduced the concept of contextualized word embeddings

🚀 Phase 5: 2017-present

This NLP phase is characterised by finding the winning recipe for building a good language model:

\[\begin{equation} \boxed{ \begin{array}{c} \textit{winning recipe} \\ = \\ \textbf{huge amounts of easy to acquire data} \\ \times \\ \textbf{a simple, high-throughput way to consume it} \end{array} } \end{equation}\]

💡 Breakthrough 1: subword tokenisation and dense embeddings

Most language models require numerical inputs and thus, text needs to be pre-processed into the expected model format. Text pre-processing focuses on:

  • splitting the input text into chunks (tokens)
  • converting each token to an integer (token ids) via look-up tables
  • mapping token ids to dense, continuous-valued vectors (embeddings)

💡 Breakthrough 1: subword tokenisation and dense embeddings

from transformers import AutoTokenizer

# Define input text and checkpoint
input_text = "NLP is the most interesting subfield of AI"
checkpoint = "distilbert-base-uncased"

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Tokenize the input text
input_tokens = tokenizer.tokenize(input_text)

# Convert tokens to IDs
input_ids = tokenizer.convert_tokens_to_ids(input_tokens)

# Display results
result_text = (
    f"Input Text:           {input_text}\n"
    f"Tokenized Text:       {input_tokens}\n"
    f"Token IDs:            {input_ids}"
)
print(result_text)
Input Text:           NLP is the most interesting subfield of AI
Tokenized Text:       ['nl', '##p', 'is', 'the', 'most', 'interesting', 'sub', '##field', 'of', 'ai']
Token IDs:            [17953, 2361, 2003, 1996, 2087, 5875, 4942, 3790, 1997, 9932]

💡 Breakthrough 1: subword tokenisation and dense embeddings

import torch
from transformers import AutoTokenizer, AutoModel

input_text = "NLP is the most interesting subfield of AI"

# Initialize the tokenizer and model
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModel.from_pretrained(checkpoint)

# Tokenize the input text
input_tokens = tokenizer(input_text, return_tensors="pt")

# Get the input embeddings
input_embeddings = model.get_input_embeddings()
embeddings = input_embeddings(input_tokens["input_ids"])

# Display the actual embeddings and their shape
print(f"Shape:      {embeddings.shape}\n" f"Embeddings: {embeddings}")
Shape:      torch.Size([1, 12, 768])
Embeddings: tensor([[[ 0.0390, -0.0123, -0.0208,  ...,  0.0607,  0.0230,  0.0238],
         [-0.0207, -0.1044, -0.0330,  ..., -0.0648, -0.0250, -0.0674],
         [-0.0398, -0.0356,  0.0207,  ..., -0.0304, -0.0539, -0.0576],
         ...,
         [-0.0415, -0.0041,  0.0112,  ...,  0.0049, -0.0181, -0.0023],
         [-0.0699, -0.0343, -0.0272,  ...,  0.0480, -0.0406, -0.0381],
         [-0.0199, -0.0095, -0.0099,  ..., -0.0235,  0.0071, -0.0071]]],
       grad_fn=<EmbeddingBackward0>)

Why tokenise at subword-level ?

There are many advantages to subword-level tokenisation, including:

  • generalisation to unseen words (variations, misspellings, novel types)
  • language agnosticism
  • efficient use of memory
  • data-driven adaptation to different domains

💡 Breakthrough 2: self-supervised learning

💡 Breakthrough 2: self-supervised learning

Currently, the state-of-the art language models are pre-trained using self-supervised learning, usually via language modelling:

from transformers import pipeline

# Define input text
input_text = "The goal of this workshop is to <mask> the audience about the power of NLP"

# Initialise the pipeline for masked language modelling
model = pipeline("fill-mask")

# Get the results
results = model(input_text)

# Display results
for i in results:
    print(f"{i['sequence']} \t {round(i['score'], 3)}")
The goal of this workshop is to educate the audience about the power of NLP      0.64
The goal of this workshop is to teach the audience about the power of NLP    0.278
The goal of this workshop is to inform the audience about the power of NLP   0.043
The goal of this workshop is to remind the audience about the power of NLP   0.011
The goal of this workshop is to engage the audience about the power of NLP   0.005

What does pre-training learn ?

While language modelling appears simple, it is a very powerful technique to learn a wide range of things since input sequences can contain any type of information, for example:

# Example 1: Syntax
"The quick brown fox jumps <mask> the lazy dog"

# Example 2: Trivia / Knowledge
"Newcastle University is located in <mask>"

# Example 3: Sentiment
"I've never laughed so much during a trip to the cinema. The Barbie movie was <mask>"

# Example 4: Coreference
"Will Ferrell stole the show in this movie. <mask> is such a good actor"

# Example 5: Mathematics
"I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, <mask>"

Where does the data come from ?

The data used to pre-train language models is usually obtained from the internet, for example:

💡 Breakthrough 3: finding promising architectures

Transformers are particularly successful due to:

  • attention: captures contextual relationships between words, allowing to understand the importance of each word in the context of the entire input
  • positional encodings: retains information about the order of words in the input sequence, enabling effective handling of sequential information
  • parallel computation: can process the entire input sequence simultaneously, making them computationally efficient and enabling faster training and inference unlike sequential nets (e.g. RNNs)

figure taken from Attention Is All You Need

What makes a language model large ?

Recall that a language model is a model that learns to fill in the blanks

The definition of a large language model is rather fuzzy and there are different ways to define it, i.e. based on:

  • the number of parameters
  • the amount of data used to train the model
  • the ability to carry out a wide range of tasks

👷 Large language model workflow

Large language models could be viewed as a subset of foundation models, which describe the paradigm shift within AI from developing task-specific models trained on narrow data to developing multi-purpose models trained on broad data

😱 Large Language Models for everything: size matters

💣 Potential paradigm shift

In 2020, Brown and colleagues report the following:

Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora

🔍 Scaling laws

In 2020, J. Kaplan and colleagues demonstrated that the performance of Large Language Models appears to improve with scaling:

  • compute
  • data
  • model size

figure taken from Scaling Laws for Neural Language Models

🔍 Scaling laws – model size

figure taken from ChatGPT, GenerativeAI and Large Language Models Timeline

🔍 Scaling laws – data size

figure taken from BabyLM Challenge

🐣 Emergent abilities in Large Language Models

  • Breakthroughs 1-3 have had a notable impact on the performance of (generative) Large Language Models which are now state-of-the-art for most benchmark NLP tasks

  • Moreover, 2022, J. Wei and colleagues observed emergent abilities for many additional tasks

figure taken from Emergent Abilities of Large Language Models

🐣 Emergent abilities in Large Language Models

animation taken from Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance

🐣 Emergent abilities in Large Language Models

😲 Some Large Language Models appear to be capable of identifying movies from emojis 😲

figure taken from Beyond the Imitation Game benchmark

What (somewhat useful) tasks can large language models actually perform ?

Large language models can perform a wide range of tasks, some of the most popular tasks include:

  • classification
  • summarisation
  • translation
  • question answering
  • generation

♦️ ♠️ ♥️ ♣️ Classification

Classification is the task of assigning a label to either a sequence of tokens or each token in a sequence

Sequence classification can be used for:

  • sentiment analysis: identifying sentiments or polarity of a text
  • grammatical correctness: identifying whether a sentence follows grammatical rules

Token classification can be used for:

  • named entity recognition: identifying entities in text
  • part-of-speech tagging: identifying semantic categories of words

💬 Sequence classification – sentiment analysis

Using the Hugging Face Pipeline API:

from transformers import pipeline

input_text = "American Football is the best sport in the world"
sentiment_model = pipeline("sentiment-analysis")
output = sentiment_model(input_text)

# display results
f"Sentiment is: {output[0]['label']} with a score of {output[0]['score']}"
'Sentiment is: POSITIVE with a score of 0.9998629093170166'

🔫 Sequence classification – zero-shot classification with custom labels

Using the Hugging Face Pipeline API:

from transformers import pipeline

input_text = "I like napping"
zero_shot_model = pipeline("zero-shot-classification")
candidate_labels = ["sleeping", "eating", "working"]
output = zero_shot_model(input_text, candidate_labels= candidate_labels)

# display results
for i in range(len(output["labels"])):
    print(f"{output['labels'][i]}: {output['scores'][i]}")
sleeping: 0.9849140048027039
working: 0.01279028132557869
eating: 0.0022957399487495422

💃 👽 🐻 Token classification – named entity recognition

Using the Hugging Face Pipeline API:

from transformers import pipeline

input_text = "Newcastle University is located in Newcastle upon Tyne"
ner_model = pipeline("ner")
output = ner_model(input_text, aggregation_strategy='simple')

# display results
for i in output:
    print(f"{i['entity_group']}: {i['word']}")
ORG: Newcastle University
LOC: Newcastle upon Tyne

📌 Summarisation

Summarisation is the task of producing a shorter version of text while retaining its key points

There are two key variants of this task, including:

  • extractive summarisation: selecting spans of text from the input text to form a summary
  • abstractive summarisation: generating new text conditional on input text to form a summary

🏈 Summarisation – examples

Consider the following input text:

# define input text
input_text = """
American football (referred to simply as football in the United States and Canada), also known as gridiron, is a team sport played by two teams of eleven players on a rectangular field with goalposts at each end. The offense, the team with possession of the oval-shaped football, attempts to advance down the field by running with the ball or passing it, while the defense, the team without possession of the ball, aims to stop the offense's advance and to take control of the ball for themselves. The offense must advance at least ten yards in four downs or plays; if they fail, they turn over the football to the defense, but if they succeed, they are given a new set of four downs to continue the drive. Points are scored primarily by advancing the ball into the opposing team's end zone for a touchdown or kicking the ball through the opponent's goalposts for a field goal. The team with the most points at the end of a game wins.

American football evolved in the United States, originating from the sports of soccer and rugby. The first American football match was played on November 6, 1869, between two college teams, Rutgers and Princeton, using rules based on the rules of soccer at the time. A set of rule changes drawn up from 1880 onward by Walter Camp, the "Father of American Football", established the snap, the line of scrimmage, eleven-player teams, and the concept of downs. Later rule changes legalized the forward pass, created the neutral zone and specified the size and shape of the football. The sport is closely related to Canadian football, which evolved in parallel with and at the same time as the American game, although its rules were developed independently from those of Camp. Most of the features that distinguish American football from rugby and soccer are also present in Canadian football. The two sports are considered the primary variants of gridiron football.

American football is the most popular sport in the United States in terms of broadcast viewership audience. The most popular forms of the game are professional and college football, with the other major levels being high-school and youth football. As of 2012, nearly 1.1 million high-school athletes and 70,000 college athletes play the sport in the United States annually. The National Football League, the most popular American professional football league, has the highest average attendance of any professional sports league in the world. Its championship game, the Super Bowl, ranks among the most-watched club sporting events in the world. The league has an annual revenue of around US$15 billion, making it the most valuable sports league in the world. Other professional leagues exist worldwide, but the sport does not have the international popularity of other American sports like baseball or basketball.
"""

🔈 Summarisation – abstractive

Using the Hugging Face Pipeline API:

from transformers import pipeline

# define the default abstractive summarisation pipeline
summary_model = pipeline("summarization")
output = summary_model(input_text, min_length=10, max_length=100)

# display results
output[0]["summary_text"]
" American football is a team sport played by two teams of eleven players on a rectangular field with goalposts at each end . The offense, the team with possession of the football, attempts to advance down the field by running with the ball or passing it, while the defense aims to stop the offense's advance . First American football match was played on November 6, 1869, between two college teams, Rutgers and Princeton, using rules based on the rules of soccer at the time ."

Question answering

Question answering is the task of retrieving an answer to a question posed in natural language

There are three key variants of this task, including:

  • extractive: the answer is a span of text from the context
  • open generative: the answer is free text based on the context
  • closed generative: the answer is free text; no context is provided

Note that the context can be either structured (e.g. tabular) or unstructured (e.g. textual)

Question answering – extractive

Textual context:

from transformers import pipeline

context = "My name is Larry the Cat and I live at 10 Downing Street"
qa_model = pipeline("question-answering")
question = "Where do I live?"
output = qa_model(question=question, context=context)

# display results
f"Answer: {output['answer']}"
'Answer: 10 Downing Street'

Question answering – extractive

Tabular context:

           players titles
0  Patrick Mahomes      2
1        Tom Brady      7
2    Aaron Rodgers      1
3      Brock Purdy      0
from transformers import pipeline
import pandas as pd

# prepare table + question
dictionary = {
    "players": ["Patrick Mahomes", "Tom Brady", "Aaron Rodgers", "Brock Purdy"], 
    "titles": ["2", "7", "1", "0"]
}
context = pd.DataFrame.from_dict(dictionary)
question = "which player have the most Super Bowls?"

# pipeline model
tqa_model = pipeline("table-question-answering")
output = tqa_model(table=context, query=question)

# display results
f"Answer: {output['cells'][0]}"
'Answer: Tom Brady'

Question answering – extractive

Image context:

from transformers import pipeline
from PIL import Image

checkpoint = "naver-clova-ix/donut-base-finetuned-docvqa"
image_qa = pipeline("document-question-answering", model=checkpoint)

question = "What is the total?"
image = Image.open("img/realistic-receipt.jpeg")
output = image_qa(image=image, question=question)

# display results
print(output)
[{'answer': '16.5'}]

📖 Question answering – open generative

Using the Hugging Face Pipeline API

from transformers import pipeline

# define the text2text-generation pipeline
text2text_generator = pipeline("text2text-generation")
question = "What is 42?"
context = "42 is the answer to life, the universe and everything"
output = text2text_generator(f"question: {question} context: {context}")

# display results
output[0]['generated_text']
'the answer to life, the universe and everything'

📕 Question answering – closed generative

Using the Hugging Face Pipeline API

from transformers import pipeline

# define question
input_text = "What is 42 ?"

# define the text generation pipeline
checkpoint = "EleutherAI/gpt-neo-125m"
text_generation_model = pipeline("text-generation", model = checkpoint)
output = text_generation_model(input_text, max_new_tokens=20)

# display results
output[0]["generated_text"]
'What is 42 ?\n42\nWhat is the units digit of (2 - -1) + -1 + -'

👕 Retrieval Augmented Generation (RAG)

RAG is a technique for augmenting LLM knowledge with additional data using the two following steps:

Step 1: create an index of documents

👕 Retrieval Augmented Generation (RAG)

RAG is a technique for augmenting LLM knowledge with additional data using the two following steps:

Step 2:

  • retrieval: retrieve relevant documents from the index
  • generation: generate an answer based on the retrieved documents

👕 Retrieval Augmented Generation (RAG)

There is a lots of nuance to RAG and it is a rather active area of research, for more details see this blog post

💎 RAG components

🏆 Choosing the right vectorstore

There is (too many) vectorstores available; some of the most popular open-source ones include:

  1. chroma
  2. pgvector

They both support exact and approximate nearest neighbor search (L2 distance, inner product, and cosine distance).

They also support maximal marginal relevance (MMR) re-ranking, which is a technique for re-ranking search results to improve diversity.

🏆 Choosing the right model

New large language models are released almost on a weekly basis, making it difficult to pick the appropriate model

To choose the right model, it is important to consider:

  • the task(s) that you want to perform
  • licenses
  • infrastructure

📅 Keeping track of new models – model list

📅 Benchmark for retriever models

📅 Benchmark for generator models

📅 Benchmark for generator models

📅 Keep track of new models – benchmarks